Crawling the Web 


Creating Data Indices 


Today’s lecture 


e Crawling 


e Connectivity servers 


Basic crawler operation 


e Begin with known “seed” URLs 
e Fetch and parse them 
—Extract URLs they point to 
—Place the extracted URLs on a queue 


e Fetch each URL on the queue and 
repeat 


Crawling picture 


URLs crawled 
and parsed 


Unseen Web 


URLs frontier 


Simple picture — complications 


Web crawling isn’t feasible with one machine 
— All of the above steps distributed 

Malicious pages 

— Spam pages 

— Spider traps — incl dynamically generated 

Even non-malicious pages pose challenges 

— Latency/bandwidth to remote servers vary 


— Webmasters’ stipulations 
e How “deep” should you crawl a site’s URL hierarchy? 


— Site mirrors and duplicate pages 
Politeness — don’t hit a server too often 


What any crawler must do 


Be Polite: Respect implicit and explicit 
politeness considerations 

— Only crawl allowed pages 

— Respect robots.txt (more on this shortly) 


Be Robust: Be immune to spider traps and 
other malicious behavior from web servers 


What any crawler should do 


e Be capable of distributed operation: designed to 
run on multiple distributed machines 

e Be scalable: designed to increase the crawl rate 
by adding more machines 


e Performance/efficiency: permit full use of 


available processing and network resources 


What any crawler should do 


e Fetch pages of “higher quality” first 
e Continuous operation: Continue fetching 
fresh copies of a previously fetched page 


e Extensible: Adapt to new data formats, 
protocols 


Updated crawling picture 


URLs crawled 
and parsed 


Unseen Web 


Crawling thread 


URL frontier 


Contains URLs to be crawled 


Can include multiple pages from the same 
host 


Must avoid trying to fetch them all at the 
same time 


Must try to keep all crawling threads busy 


Explicit and implicit politeness 


e Explicit politeness: specifications from 
webmasters on what portions of site can be 
crawled 


— robots.txt 

e Implicit politeness: even with no 
specification, avoid hitting any site too 
often 


Robots.txt 


e Protocol for giving spiders (“robots”) limited 
access to a website, originally from 1994 


— www.robotstxt.org/wc/norobots.html 


Website announces its request on what can(not) 
be crawled 


— Fora URL, create a file URL/robots.txt 


— This file specifies access restrictions 


Robots.txt example 


e No robot should visit any URL starting with 
"/yoursite/temp/", except the robot called 
“searchengine”: 


User-agent: * 
Disallow: /yoursite/temp/ 


User-agent: searchengine 
Disallow: 


Processing steps in crawling 


Pick a URL from the frontier << oe 
L 


Fetch the document at the UR 
Parse the URL 

— Extract links from it to other docs (URLs) 

Check if URL has content already seen 

— If not, add to indexes E.g., only crawl .edu, obey 
For each extracted URL ee SE 

— Ensure it passes certain URL filter test 


— Check if it is already in the frontier (duplicate URL 
elimination) 


Basic crawl architecture 


— URL 
filters set 


Dup 
URL 


filter 
elim 


URL Frontier 


DNS (Domain Name Server) 


e A lookup service on the internet 
— Given a URL, retrieve its IP address 
— Service provided by a distributed set of servers — thus, 
lookup latencies can be high (even seconds) 
e Common OS implementations of DNS lookup are 
blocking: only one outstanding request at a time 
e Solutions 
— DNS caching 


— Batch DNS resolver — collects requests and sends 
them out together 


Parsing: URL normalization 


e When a fetched document is parsed, some of 
the extracted links are relative URLs 


° E.g., at http://en.wikipedia.org/wiki/Main Page 


we have a relative link to 
/wiki/Wikipedia:General_ disclaimer which is the 
same as the absolute URL 


e During parsing, must normalize (expand) such 
relative URLs 


Content seen? 


Duplication is widespread on the web (~ 30%) 


If the page just fetched is already in the index, do 
not further process it 
This is verified using document fingerprints or 


shingles 


duplicateDocs corrected 


Filters and robots.txt 


e Filters — regular expressions for URL's to 
be crawled/not 


e Once a robots.txt file is fetched from a 
site, need not fetch it repeatedly 


— Doing so burns bandwidth, hits web 
server 


e Cache robots.txt files 


Duplicate URL elimination 


e For a non-continuous (one-shot) crawl, test 
to see if an extracted+filtered URL has 
already been passed to the frontier 


e For a continuous crawl — see details of 
frontier implementation 


Distributing the crawler 


e Run multiple crawl threads, under different 
processes — potentially at different nodes 


— Geographically distributed nodes 
e Partition hosts being crawled into nodes 
— Hash used for partition 


e How do these nodes communicate? 


Communication between nodes 


e The output of the URL filter at each node is sent to the 
Duplicate URL Eliminator at all nodes 


— > ED Mie ae 


DNS 
Doc robots other URL 
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URL frontier: two main considerations 


e Politeness: do not hit a web server too frequently 


e Freshness: crawl some pages more often than 
others 
— E.g., pages (such as News sites) whose content 

changes often 

These goals may conflict each other. 

(E.g., simple priority queue fails — many links out of 
a page go to its own site, creating a burst of 
accesses to that site.) 


Politeness — challenges 


° Even if we restrict only one thread to fetch 
from a host, can hit it repeatedly 

e Common heuristic: insert time gap between 
successive requests to a host that is >> time 
for most recent fetch from that host 


URL frontier: Mercator scheme 


URLs 


Prioritizer 


Biased front queue selector 
Back queue router 


Back queue selector 


Craw? thread requesting URL 


Mercator URL frontier 


URLs flow in from the top into the frontier 
Front queues manage prioritization 

Back queues enforce politeness 

Each queue is FIFO 


Front queues 


Prioritizer 


Biased front queue selector 
Back queue router 


Front queues 


e Prioritizer assigns to URL an integer priority 
between 1 and K 


— Appends URL to corresponding queue 
e Heuristics for assigning priority 
— Refresh rate sampled from previous crawls 


— Application-specific (e.g., “crawl news sites more 
often”) 


Biased front queue selector 


e When a back queue requests a URL (ina 
sequence to be described): picks a front queue 
from which to pull a URL 


e This choice can be round robin biased to queues 
of higher priority, or some more sophisticated 
variant 


— Can be randomized 


Back queues 


Biased front queue selector 
Back queue router 


Back queue invariants 


e Each back queue is kept non-empty while the 
crawl is in progress 

e Each back queue only contains URLs from a 
single host 
— Maintain a table from hosts to back queues 


Back queue heap 


e One entry for each back queue 
e The entry is the earliest time t, at which the host 
corresponding to the back queue can be hit again 
e This earliest time is determined from 
— Last access to that host 
— Any time buffer heuristic we choose 


Back queue processing 


A crawler thread seeking a URL to crawl: 
Extracts the root of the heap 

Fetches URL at head of corresponding back 
queue g (look up from table) 


Checks if queue q is now empty — if so, pulls a 
URL v from front queues 


— |f there’s already a back queue for v’s host, append v 
to g and pull another URL from front queues, repeat 


— Else addvtog 
When g is non-empty, create heap entry for it 


Number of back queues B 


e Keep all threads busy while respecting politeness 


e Mercator recommendation: three times as many 
back queues as crawler threads 


Web Search in 2020? 


Type keywords into a search box? 
Social or “human powered” search? 
The Semantic Web? 


Intelligent search/semantic search/natural 
language search? 


Intelligent Search 


Instead of merely retrieving Web pages, read ‘em! 


Machine Reading = Information Extraction + tractable 
inference 


Alan Smithson will give a talk at the UW database seminar on 
Friday Dec 5 


e 1E(sentence) = who did what? 
— speaking(Alan Smithson, UW) 


e Inference = uncover implicit information 
— Will Alan visit Seattle? 


Application: Information Fusion 


e What kills bacteria? 

e Which west coast, nano-technology 
companies are hiring? 

e What is a quiet, inexpensive, 4-star hotel in 
Vancouver? 


Opinion Mining 


Opi NE (Popescu & Etzioni, EMNLP ’05) 
e |E(product reviews) 
— Informative 


— Abundant, but varied 
— Textual 


e Summarize reviews without any prior 
knowledge of product category 


Product - Windows Internet Explorer 


+ r 
ae 


Æ | http: knowitall-3.cs washington .edu/opine/Product.aspx?Product=.:New _York_City_hotels:Renaissance_New _York_Hotel_Times_Square 


22 
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OPINE 


Ana-Maria Popescu, Bao Nguyen, Oren Etzioni 


Home | Language: EF | 


New York City hotels > Renaissance New York Hotel Times Square 


Review Summary 


Staff: excellent (7), great (3), very helpful (2), poor, fantastic, helpful, love, good, view oi! (17) 
Location: great (4), best (3), good (2), fabulous, fantastic, ideal, superb, not great, love, view ai! (15) 


Room: nice (5), great (2), not great (2), good (2), very nice (2), excellent, superb, lovely, average, view 
ai! (17) 


Quality: best, fantastic, lovely, recommend, love, nice, fine, view ai! (7) 


Food: very good (2), fantastic, lovely, not great, great, view all (6) 

Bathroom beauty: beautiful 

Bar: fabulous, great, view ai! (2) 

Staff friendliness: friendly (4), very friendly (2), incredibly friendly, unfriendly, view ail (8) 
Room bed comfort: comfy (2), comfortable (2), extremely comfortable, view ai! (5) 
Bathroom: great (2), elegant, very nice, nice, view all (5) 


Room cleanness: clean (2) 


User comments: 


the rooms were clean and smelled great . Read more 


The rooms were clean, spacious, soundproof and well-appointed . Read more 


El 
FT FF [ | [@rteme 100% + y 
A start| 3 Microsoft Office Ou... -I| @ Product - Windows I... [E] 4 Microsoft PowerPoint -| Se etzioni on 'rfilesrv2\fac... | [O talks | ~ o BOO) ae Y 10:32 am 


OPINE 


Ana-Maria Popescu, Bao Nguyen, Oren Etzioni 


When compared to Renaissance New York Hotel 


Review Summary : 1 
Times Square, Room cleanness is 


Canal house beauty; beautiful . hetter at A Greenwich Village Habitue 

Location: perfect _. PONT NES 
e worse at Morningside Inn (34 others 

Room: gorgeous e similar at Chelsea Inn - 17th Street (86 


others} 


Better hotels: 


City comfort; comfortable 


New York City hotels > A Greenwich Village 
Room bed comfort: comfortable Hohitue 


Bar distance: close ew Yo ity hotels nion Square Inn 


New York City hotels > Sofitel New York 
New York City hotels > Second Home on 


Add new opinion: Second Avenue 
- New York City hotels > The Muse 
Feature: [staff “| Opinion: [beautiful z] Opinion text (optional) ; 
ew York Cit Is elleclaire Hotel 
2 New York City hotels > The St. Regis 


New York City hotels > Kitano New York 


New York City hotels > Milburn Hotel 
zÍ New York City hntels > Hntel 41 At Times 
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TextRunner Extraction 


e Extract Triple representing binary relation 
(Arg1, Relation, Arg2) from sentence. 


Internet powerhouse, EBay, was originally 
founded by Pierre Omidyar. 


Internet powerhouse, EBay, was originally 
founded by Pierre Omidyar. 


(Ebay, Founded by, Pierre Omidyar) 


Numerous Extraction Challenges 


° Drop non-essential info: 

“was originally founded by” > founded by 

e Retain key distinctions 

Ebay founded by Pierr + Ebay founded Pierre 
e Non-verb relationships 

“George Bush, president of the U.S...” 

e Synonymy & aliasing 


Albert Einstein = Einstein 4 Einstein Bros. 


Question Answering (QA) 
from Open-Domain Text 


An idea originating from the IR community 


With massive collections of full-text documents, simply 
finding relevant documents is of limited use: we want answers 
from textbases 


QA: give the user a (short) answer to their question, perhaps 
supported by evidence. 


The common person’s view? [From a novel] 


— “I like the Internet. Really, | do. Any time | need a piece of shareware or | want to find 
out the weather in Bogota ... I’m the first guy to get the modem humming. But as a 
source of information, it sucks. You got a billion pieces of data, struggling to be heard 
and seen and downloaded, and anything | want to know seems to get trampled 
underfoot in the crowd.” 

e M. Marshall. The Straw Men. HarperCollins Publishers, 2002. 


People Want to Ask Questions... 


Examples from AltaVista query log 

who invented surf music? 

how to make stink bombs 

where are the snowdens of yesteryear? 

which english translation of the bible is used in official catholic liturgies? 
how to do clayart 

how to copy psx 

how tall is the sears tower? 

Examples from Excite query log (12/1999) 

how can 1 find someone in texas 

where can 1 find information on puritan religion? 

what are the 7 wonders of the world 

how can 1 eliminate stress 

What vacuum cleaner does Consumers Guide recommend 
Around 12-15% of query logs 
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The Google answer #1 


Include question words etc. in your stop-list 
Do standard IR 


Sometimes this (sort of) works: 


Question: Who was the prime minister of 
Australia during the Great Depression? 


Answer: James Scullin (Labor) 1929-31. 


=/O) x! 


4 Google Search: Who was the prime minister of Australia during the Great Depression? Microsoft Internet E pl pre 


| Br" Ə [À A | a ‘4 | Ey. ə al Fie Edit View Favorites Tools Help |Links ceon Links di 
| | Address g) +of+Australia+during+the+Great+Depression%3F&btnG=Google+Search Y Í | Google ~ feat Depression? >| Bsc: arch S e | News 


Advanced Search Preferences Language Tools Search Tips 


( oogle Who was the prime minister of Australi: | Google Search 


The following words are very common and were not included in your search: Who 
was the of the. [details 


VA Images | Groups | Directory! News | 
Searched the web for Who was the prime minister of Australia during the Great Depression?. Results 1 - 1 


Asking a question? Try out Google Answers. : 
Page about Curtin (WW II 


Labor Prime Minister) 
Can deduce answer 


From Poor Boy to Prime Minister 
… how did he come to lead Australia during World Var ... April 
takes piel a . . he original \ Who's Who 
Prime Minister after the death of Lyons; Sept 3 1939 Australia sid ca War... First published in 1906 
Alles curtin.edu. po RENE html - 23k- Mar 1, 2003 - Cached - | 


à le 


Interest: = 


Activity: Banning of the Communist Party in World 
… The Great Depression had brought enormous suffering 
the ‘Prime Minister and His .. .. the Communist ay in arte ae 


[ Mor e resu ults fr rom john.curtin.edu.au ] 


Baimen re Page about Chifley 
rime Ministers of Australia - Chifley : A 

.. defying the federal United Australia Party government... (La bor Prime Minister) 
the 1930s. ... He became Prime Minister following Curtin's death, su ae Can deduce answer 


PA E hia 30k - Cached - Similar pages 


|_| 


(@ Internet 


But often it doesn’t... 


e Question: How much money did IBM spend on 
advertising in 2002? 


e Answer: / dunno, but I'd like to ... © 


4 Google Search: How much money did IBM spend on advertising in 2002? - Microsoft Internet Explorer -(O x| 
| day à - = E Al n salle E3 3 L B- =) = File Edit View Favorites Tools Help | | Links &] Customize Links * 


I Google ~ [ertising in 2002? x] search ste | GÂnens | PageRank » 


Advanced Search Preferences Language Tools Search Tips 


( OOS ole How much money did IBM spendona Google Search | 


The following words are very common and were not included in your search: How on in. [details 


Images | Groups | Directory! News 
Searched the web for How much money did IBM spend on advertising in 2002?. Results 1 - 10 of about 13,000. Search too 
Asking a question? Try out Google Answers. 


ponsored Link 


Money 2002 - Software on eBay - Buy or Sell Here! 
waw eBay.com Software for PCs, Macs, & More 


Money 2002 - Find Prices, Reviews, Pictures, and mdve! 
www pricegrabber.com Comparison Shopping Beyond Compare on 


ponsored Link 


Advertising metrics H Links 
… How much money does it cost us to get a sales lead or ang No relevant info 
more can you get by spending that money again ... How did yon (Ma rketing firm page) 


ww. perrymarshall.com/marketing/10.htm - 22k - Cached - Similar Bade 


ver 1500 Stores 
Store Ratings 


win BizRate.com 
Business 2.0 - Magazine Article - Shelly Lazarus 
… don't think you can spend too much money on marketing ... j No relevant info Pe ia 
simulate real-life interaction ... the value of a brand is so much ie 
www business2.com/articles/mag/0,1640,17509,00 html - 34k - (Mag page on ad exec) 


Similar pages 


tition about 
> m 


ese 2) pay-per-click ad secrets 
POCA DAS FO err) 


a@ Da No relevant info 


B2B requires simple marketing materials - 2002-07- 
… historic first sale to IBM, he didn't have a product, much les 
B2B companies with a direct sales force spend tons of money OM@nark (Mag page on MS-IBM) 
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| | [@ Internet 


How much money did IBM spend on advertising in 2002? 


— 


£9 how much money did ibm (P= fox 


e C! | © www.google.ca/search?sourceid=chrome&ie=UTF-8&q=Question%3A+How+much+money+did+1BM+spend+on+advertising+in+2002#sclient=psy&hl=en&safe=off&source w| @ A” 


Web Videos Maps News Translate Gmail more - alhenshiri@gmail.com - % 


How much money did IBM spend on advertising in 2002 baksa Sao aono y 


About 7,040,000 results (0.22 seconds) Go to Google.com Advanced search 


Everything > [ppt] CS276B Text Information Retrieval, Mining, and Exploitation Ads 

webdocs. cs.ualberta.ca/~lindek/650/Slides/ManningQA ppt Top 10 Ad i 
Images e Format: Microsoft Powerpoint - Quick View Top 10 Advertisers 
4 Mar 2003 — Question: How much money did IBM spend on a in 2002? www kantarmediana.com 


: AE Want To Know Which Advertisers 
A “Id but I'd like t Z. Lot of ad Google these days! .. 
re ares E et re Have Spent The Most in Advertising? 


j Videos 


News 


por CS276B Question Answering from text People want to ask questions ... See yourad heres 
Books www.stanford.edu/class/cs276b/handouts/lecture1 3.pdf 


e For t: PDF/Adobe Acrobat - Quick View 

M - Related articles 
(Can deduce answer). 6. But often it doesn't... m Question: How much money did IBM 
spend on advertising in 2002? m Answer | dunno, but l'd like to ... © 


More 


Halifax, NS 
Change location 


por Statistical NLP Question Answering Question Answering from Text... 
WWW.CS. e edu/.../SP08%20cs294%20lecture%2023%20--... 


The web e PDF/Adobe Acrobat - Quick View 
Pages from Canada Pane abil Chifley. (Labor Prime Minister). (Can deduce answer). But often it doesn't... = N o An swer 
Question: How much money did IBM spend on advertising in 2002? ... 20 1 1 


All results por] CS 294-5: Statistical Natural Language Processing Assignment 3 … 
Timeline www. cs.berkeley.edu/~klein/.../cs294-5%20lecture%2024%20--%20draft.p... 


m aoe Acrobat - Quick View 
Ans\ g - Related articles 
More search tools tuesta How much money did IBM spend on advertising in 2002? m Answer: |.. 


+) Show more results from berkeley.edu 


webdocs.cs.ualberta,ca/~lindek/650/Slides/ManningQA. ppt 


The Google answer #2 


Take the question and try to find it as a string on the web 
Return the next sentence on that web page as the answer 


Works brilliantly if this exact question appears as a FAQ 
question, etc. 


Works lousily most of the time 


Reminiscent of the line about monkeys and typewriters 
producing Shakespeare 


But a slightly more sophisticated version of this approach has 
been revived in recent years with considerable success... 


A Brief (Academic) History 


In some sense question answering is not a new 
research area 

Question answering systems can be found in 
many areas of NLP research, including: 


e Natural language database systems 
— Alot of early NLP work on these (e.g., LUNAR) 


e Spoken dialog systems 
— Currently very active and commercially relevant 


The focus on open-domain QA is fairly new 
— MURAX (Kupiec 1993): Encyclopedia answers 
— Hirschman: Reading comprehension tests 
— TREC QA competition: 1999— 


Question Answering at TREC 


Question answering competition at TREC 

Until 2004, consisted of answering a set of 500 fact-based 
questions, e.g., “When was Mozart born?”. 

For the first three years systems were allowed to return 5 ranked 
answer snippets (50/250 bytes) to each question. 

— IR think 

— Mean Reciprocal Rank (MRR) scoring: 

e 1,0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6+ doc 

— Mainly Named Entity answers (person, place, date, ...) 

From 2002 the systems are only allowed to return a single exact 
answer and the notion of confidence has been introduced. 


/ RS | when was mozart born? - 


e © © www.google.ca/search?sot 


Web Images Videos Maps News Translate Gmail more ~ 


Google when was mozart born? La: Smal SafeSearch off v 


About 21,700,000 results (0.27 seconds) Go to Google.com Advanced search 


Everything Best guess for Mozart Date of birth is January 27, 1756 - Also try 
See hen did mozart die 
Images Mentioned on at least 9 websites including wikipedia.org, answers.com when moza 


si and mozartones.com - [+] Show sources when was bach born 
Videos when was beethoven born 


Li : where did mozart grow up 
News Wolfgang Amadeus Mozart - Wikipedia, the free Se ee re eed 
encyclopedia 
More en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart - Cached 
... name Johannes Chrysostomus Wolfgangus Theophilus Mozart (27 
January 1756 — 5... Wolfgang Amadeus Mozart was born to Leopold 
Halifax, NS and Anna Maria Pert! Mozart... 


Change location Death of Mozart - List of compositions by Wolfgang ... - Antonio Salieri - 
Mozart family 


See your ad here » 


The web 
sur When was Mozart born and when did he pass away Exact answer 


Pages from Canada 
wiki.answers.com > ...» Composers > Wolfgang A 


Cached with context. 
All results Mozart was bæn on the 27th January 1756 in Salzburg, Austria and 
died on the 5th of December, 1791 in Vienna, Austria. :-) He was 36 when 
he died. he was... 


Timeline 


The TREC Document Collection 


The retrieval collection uses news articles from the following 
sources: 

e AP newswire, 1998-2000 

e New York Times newswire, 1998-2000 

e Xinhua News Agency newswire, 1996-2000 
In total there are 1,033,461 documents in the collection. 3GB of 
text 
This is a lot of text to process entirely using advanced NLP 
techniques so the systems usually consist of an initial information 
retrieval phase followed by more advanced processing. 


Many supplement this text with use of the web, and other 
knowledge bases 


1. 
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Sample TREC questions 


Who is the author of the book, "The Iron Lady: A 
Biography of Margaret Thatcher"? 

What was the monetary value of the Nobel Peace 
Prize in 1989? 


. What does the Peugeot company manufacture? 
. How much did Mercury spend on advertising in 1993? 
. What is the name of the managing director of Apricot 


Computer? 


. Why did David Koresh ask the FBI for a word processor? 
. What debts did Qintex group leave? 
. What is the name of the rare neurological disease with 


symptoms such as: involuntary movements (tics), swearing, 
and incoherent vocalizations (grunts, shouts, etc.)? 
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